Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
translated by 谷歌翻译
本文研究了基于图像的蒙版自动编码器(MAE)的简单扩展,以从音频谱图中学习自我监督的表示。在MAE中的变压器编码器编码器设计之后,我们的Audio-MAE首先编码具有较高遮罩比的音频谱图斑块,仅通过编码器层馈入非掩盖令牌。然后,解码器重新订购并解码编码的上下文,并用掩码令牌填充,以重建输入频谱图。我们发现将局部窗户注意力纳入解码器是有益的,因为音频谱图在当地时间和频带中高度相关。然后,我们在目标数据集上以较低的掩模比微调编码器。从经验上讲,音频MAE在六个音频和语音分类任务上设定了新的最先进的性能,超过了使用外部监督预训练的其他最新模型。代码和模型将在https://github.com/facebookresearch/audiomae上。
translated by 谷歌翻译
在过去的几年中,语音表征的自我监督学习(SSL)受到了很多关注,但大多数工作都集中在具有大量未标记数据的语言和域上。但是,对于许多语言,即使在未标记的数据中也存在短缺,这限制了SSL的有效性。在这项工作中,我们专注于通过利用WAV2VEC 2.0预处理的数据增强来将SSL应用于域具有有限数据的域的问题。此外,我们建议对模型的每个组件进行改进,从而与LibrisPeech测试清除 /其他的WAV2VEC 2.0相比,将相对单词错误率(WER)提高高达13%。
translated by 谷歌翻译
无监督的语音识别表现出了使每种语言都可以访问的自动语音识别(ASR)系统的巨大潜力。但是,现有方法仍然严重依赖手工制作的预处理。与端到端进行监督语音识别的趋势类似,我们介绍了WAV2VEC-U 2.0,它消除了所有音频端的预处理,并通过更好的体系结构提高了准确性。此外,我们引入了一个辅助自我监督的目标,该目标将模型的预测与输入联系起来。实验表明,WAV2VEC-U 2.0在概念上更简单的同时,可以改善不同语言的无监督识别结果。
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. 1 1 Code and models are available at https://github.com/pytorch/fairseq Preprint. Under review.
translated by 谷歌翻译
FAIRSEQ is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found here: https://www.youtube. com/watch?v=OtgDdWtHvto.
translated by 谷歌翻译
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semisupervised settings where labeled data is scarce. Code and models are available at https://github.com/ facebookresearch/VideoPose3D
translated by 谷歌翻译
In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date. The most popular sequence to sequence models typically "generate and hope" generic utterances that can be memorized in the weights of the model when mapping from input utterance(s) to output, rather than employing recalled knowledge as context. Use of knowledge has so far proved difficult, in part because of the lack of a supervised learning benchmark task which exhibits knowledgeable open dialogue with clear grounding. To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.
translated by 谷歌翻译
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. 1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
translated by 谷歌翻译